{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Name: Md Mintu Miah, ID: 1001405116"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# IMDB-sentiment Analysis Using Naive Bayes Classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Test classification is done for the purpose of finding tags or catagories of the text according to their contents. In this analysis, the data set is a collection of 50,000 reviews from IMDB. I have taken the process data from https://www.kaggle.com/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews/data and orginal data is available in here http://ai.stanford.edu/~amaas/data/sentiment/. The purpose of this analysis was exploring the naive bayes classification with text data. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Import the data and explore the contents"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Read The data\n",
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.naive_bayes import MultinomialNB"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# Import the data and see the data type"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
review
\n",
"
sentiment
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
One of the other reviewers has mentioned that ...
\n",
"
positive
\n",
"
\n",
"
\n",
"
1
\n",
"
A wonderful little production. <br /><br />The...
\n",
"
positive
\n",
"
\n",
"
\n",
"
2
\n",
"
I thought this was a wonderful way to spend ti...
\n",
"
positive
\n",
"
\n",
"
\n",
"
3
\n",
"
Basically there's a family where a little boy ...
\n",
"
negative
\n",
"
\n",
"
\n",
"
4
\n",
"
Petter Mattei's \"Love in the Time of Money\" is...
\n",
"
positive
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" review sentiment\n",
"0 One of the other reviewers has mentioned that ... positive\n",
"1 A wonderful little production.
The... positive\n",
"2 I thought this was a wonderful way to spend ti... positive\n",
"3 Basically there's a family where a little boy ... negative\n",
"4 Petter Mattei's \"Love in the Time of Money\" is... positive"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data=pd.read_csv('C:/Users/mxm5116/Desktop/Data Mining/IMDB Dataset.csv')\n",
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(50000, 2)\n"
]
}
],
"source": [
"# Check the shape of the data\n",
"print(data.shape)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
review
\n",
"
sentiment
\n",
"
\n",
" \n",
" \n",
"
\n",
"
count
\n",
"
50000
\n",
"
50000
\n",
"
\n",
"
\n",
"
unique
\n",
"
49582
\n",
"
2
\n",
"
\n",
"
\n",
"
top
\n",
"
Loved today's show!!! It was a variety and not...
\n",
"
positive
\n",
"
\n",
"
\n",
"
freq
\n",
"
5
\n",
"
25000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" review sentiment\n",
"count 50000 50000\n",
"unique 49582 2\n",
"top Loved today's show!!! It was a variety and not... positive\n",
"freq 5 25000"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now lets, see the summary of the data set\n",
"data.describe()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"positive 25000\n",
"negative 25000\n",
"Name: sentiment, dtype: int64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check the positive and negative number of sentiment\n",
"data['sentiment'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# a. Divide the dataset as train,and test¶ data sets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# First clear and normalized the data and divide again as normalized train, and test data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Now clean the text"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Import library\n",
"from bs4 import BeautifulSoup\n",
"import re,string,unicodedata\n",
"# Removing the html strips\n",
"def strip_html(text):\n",
" soup = BeautifulSoup(text, \"html.parser\")\n",
" return soup.get_text()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Remove the square brackets\n",
"def remove_between_square_brackets(text):\n",
" return re.sub('\\[[^]]*\\]', '', text)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Remoove the noisy text\n",
"def denoise_text(text):\n",
" text = strip_html(text)\n",
" text = remove_between_square_brackets(text)\n",
" return text\n",
"#Apply function on review column\n",
"data['review']=data['review'].apply(denoise_text)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Now remove special character and apply function for the review colums\n",
"def remove_special_characters(text, remove_digits=True):\n",
" pattern=r'[^a-zA-z0-9\\s]'\n",
" text=re.sub(pattern,'',text)\n",
" return text\n",
"data['review']=data['review'].apply(remove_special_characters)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# Streaming the text\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"import nltk\n",
"def simple_stemmer(text):\n",
" ps=nltk.porter.PorterStemmer()\n",
" text= ' '.join([ps.stem(word) for word in text.split()])\n",
" return text\n",
"#Apply function on review column\n",
"data['review']=data['review'].apply(simple_stemmer)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
review
\n",
"
sentiment
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
one of the other review ha mention that after ...
\n",
"
positive
\n",
"
\n",
"
\n",
"
1
\n",
"
A wonder littl product the film techniqu is ve...
\n",
"
positive
\n",
"
\n",
"
\n",
"
2
\n",
"
I thought thi wa a wonder way to spend time on...
\n",
"
positive
\n",
"
\n",
"
\n",
"
3
\n",
"
basic there a famili where a littl boy jake th...
\n",
"
negative
\n",
"
\n",
"
\n",
"
4
\n",
"
petter mattei love in the time of money is a v...
\n",
"
positive
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" review sentiment\n",
"0 one of the other review ha mention that after ... positive\n",
"1 A wonder littl product the film techniqu is ve... positive\n",
"2 I thought thi wa a wonder way to spend time on... positive\n",
"3 basic there a famili where a littl boy jake th... negative\n",
"4 petter mattei love in the time of money is a v... positive"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" review sentiment score\n",
"0 one of the other review ha mention that after ... positive 1\n",
"1 A wonder littl product the film techniqu is ve... positive 1\n",
"2 I thought thi wa a wonder way to spend time on... positive 1\n",
"3 basic there a famili where a littl boy jake th... negative 0\n",
"4 petter mattei love in the time of money is a v... positive 1"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now take the positive sentiment data from training set\n",
"train_data=data[:4000]\n",
"positive_docs=train_data.loc[train_data['sentiment']!=0]\n",
"positive_docs.head()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['A wonder littl product the film techniqu is veri unassum veri oldtimebbc fashion and give a comfort and sometim discomfort sens of realism to the entir piec the actor are extrem well chosen michael sheen not onli ha got all the polari but he ha all the voic down pat too you can truli see the seamless edit guid by the refer to william diari entri not onli is it well worth the watch but it is a terrificli written and perform piec A master product about one of the great master of comedi and hi life the realism realli come home with the littl thing the fantasi of the guard which rather than use the tradit dream techniqu remain solid then disappear It play on our knowledg and our sens particularli with the scene concern orton and halliwel and the set particularli of their flat with halliwel mural decor everi surfac are terribl well done',\n",
" 'I thought thi wa a wonder way to spend time on a too hot summer weekend sit in the air condit theater and watch a lightheart comedi the plot is simplist but the dialogu is witti and the charact are likabl even the well bread suspect serial killer while some may be disappoint when they realiz thi is not match point 2 risk addict I thought it wa proof that woodi allen is still fulli in control of the style mani of us have grown to lovethi wa the most Id laugh at one of woodi comedi in year dare I say a decad while ive never been impress with scarlet johanson in thi she manag to tone down her sexi imag and jump right into a averag but spirit young womanthi may not be the crown jewel of hi career but it wa wittier than devil wear prada and more interest than superman a great comedi to go see with friend',\n",
" 'basic there a famili where a littl boy jake think there a zombi in hi closet hi parent are fight all the timethi movi is slower than a soap opera and suddenli jake decid to becom rambo and kill the zombieok first of all when your go to make a film you must decid if it a thriller or a drama As a drama the movi is watchabl parent are divorc argu like in real life and then we have jake with hi closet which total ruin all the film I expect to see a boogeyman similar movi and instead i watch a drama with some meaningless thriller spots3 out of 10 just for the well play parent descent dialog As for the shot with jake just ignor them',\n",
" 'petter mattei love in the time of money is a visual stun film to watch Mr mattei offer us a vivid portrait about human relat thi is a movi that seem to be tell us what money power and success do to peopl in the differ situat we encount thi be a variat on the arthur schnitzler play about the same theme the director transfer the action to the present time new york where all these differ charact meet and connect each one is connect in one way or anoth to the next person but no one seem to know the previou point of contact stylishli the film ha a sophist luxuri look We are taken to see how these peopl live and the world they live in their own habitatth onli thing one get out of all these soul in the pictur is the differ stage of loneli each one inhabit A big citi is not exactli the best place in which human relat find sincer fulfil as one discern is the case with most of the peopl we encounterth act is good under Mr mattei direct steve buscemi rosario dawson carol kane michael imperioli adrian grenier and the rest of the talent cast make these charact come alivew wish Mr mattei good luck and await anxious for hi next work']"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# make the list of positive sentiment\n",
"train_pos_reviews=positive_docs['review']\n",
"train_pos_voca=train_pos_reviews.values.tolist()\n",
"train_pos_voca[1:5]"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"# Join the positive sentiment with single dot\n",
"train_pos_voca='.'.join(train_pos_voca)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3978\n"
]
}
],
"source": [
"# Now calculate the number of positive documents having the\n",
"words=[\"the\"]\n",
"sentences = train_pos_voca\n",
"count=0\n",
"for sentence in sentences :\n",
" for word in words :\n",
" if word in sentence :\n",
" count=count+1\n",
" #print(count)\n",
" #print(count)\n",
"num_of_pos_documents_containing_the=count\n",
"print(num_of_pos_documents_containing_the)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4000\n"
]
}
],
"source": [
"# Find the totl positive documents in training data set\n",
"num_of_all_pos_documents=positive_docs['review'].count()\n",
"print(num_of_all_pos_documents)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9945\n"
]
}
],
"source": [
"# Now calculate P[“the” | Positive] = # of positive documents containing “the” / num of all positive review documents\n",
"probability_0f_the_in_positive_docs=num_of_pos_documents_containing_the/num_of_all_pos_documents\n",
"print(probability_0f_the_in_positive_docs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# d.\tCalculate accuracy using dev dataset \n",
"\t# Conduct five fold cross validation\n"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Convert the data in vector fpormate\n",
"tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))\n",
"tf_idf_train = tf_idf_vect.fit_transform(X_train)\n",
"tf_idf_test = tf_idf_vect.transform(X_test)\n",
"\n",
"alpha_range = list(np.arange(0,10,1))\n",
"len(alpha_range)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\mxm5116\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10\n",
" 'setting alpha = %.1e' % _ALPHA_MIN)\n",
"C:\\Users\\mxm5116\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10\n",
" 'setting alpha = %.1e' % _ALPHA_MIN)\n",
"C:\\Users\\mxm5116\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10\n",
" 'setting alpha = %.1e' % _ALPHA_MIN)\n",
"C:\\Users\\mxm5116\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10\n",
" 'setting alpha = %.1e' % _ALPHA_MIN)\n",
"C:\\Users\\mxm5116\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10\n",
" 'setting alpha = %.1e' % _ALPHA_MIN)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 0.8233\n",
"1 0.8845749999999999\n",
"2 0.879425\n",
"3 0.8753749999999998\n",
"4 0.8727500000000001\n",
"5 0.8703\n",
"6 0.8679499999999999\n",
"7 0.86595\n",
"8 0.8638\n",
"9 0.86205\n"
]
}
],
"source": [
"# take different values of alpha in cross validation and finding the accuracy score\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"\n",
"alpha_scores=[]\n",
"\n",
"for a in alpha_range:\n",
" clf = MultinomialNB(alpha=a)\n",
" scores = cross_val_score(clf, tf_idf_train, y_train, cv=5, scoring='accuracy')\n",
" alpha_scores.append(scores.mean())\n",
" print(a,scores.mean())"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Plot b/w misclassification error and CV mean score.\n",
"import matplotlib.pyplot as plt\n",
"\n",
"MSE = [1 - x for x in alpha_scores]\n",
"\n",
"\n",
"optimal_alpha_bnb = alpha_range[MSE.index(min(MSE))]\n",
"\n",
"# plot misclassification error vs alpha\n",
"plt.plot(alpha_range, MSE)\n",
"\n",
"plt.xlabel('hyperparameter alpha')\n",
"plt.ylabel('Misclassification Error')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"optimal_alpha_bnb\n",
"\n",
"# For alpha =1, we have got minimum misscalculation error"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# e.\tDo following experiments\n",
"\tCompare the effect of Smoothing\n",
"\tDerive Top 10 words that predicts positive and negative class \n",
" •\tP[Positive| word] \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Effects of non-smoothing and smoothing "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# We have already got the effects of smoothing and non-smoothing. When we have considered alpha=0 (non-smoothing), we got the accuracy 82.33% whereas with smoothing our accuacy is always greater than non-smoothing conditions. We have got best smoothing parapmeter alpha=1 with hoighest accuracy 88.46%"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to\n",
"[nltk_data] C:\\Users\\mxm5116\\AppData\\Roaming\\nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now lets see the highest positive and negative words that has highest sentiment prediction capacity\n",
"import re\n",
"import string\n",
"import nltk\n",
"from nltk.corpus import stopwords\n",
"from nltk.stem import PorterStemmer\n",
"from nltk.stem.wordnet import WordNetLemmatizer\n",
"nltk.download('stopwords')"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"# Now we will remove stop words as it does not carry significant meaning and will store positive and negative word for selections\n",
"stop = set(stopwords.words('english')) \n",
"sno = nltk.stem.SnowballStemmer('english') \n",
"def cleanhtml(sentence): \n",
" cleanr = re.compile('<.*?>')\n",
" cleantext = re.sub(cleanr, ' ', sentence)\n",
" return cleantext\n",
"def cleanpunc(sentence): \n",
" cleaned = re.sub(r'[?|!|\\'|\"|#]',r'',sentence)\n",
" cleaned = re.sub(r'[.|,|)|(|\\|/]',r' ',cleaned)\n",
" return cleaned\n",
"i=0\n",
"str1=' '\n",
"final_string=[]\n",
"all_positive_words=[] \n",
"all_negative_words=[] \n",
"s=''\n",
"for sent in data['review'].values:\n",
" filtered_sentence=[]\n",
" sent=cleanhtml(sent) \n",
" for w in sent.split():\n",
" for cleaned_words in cleanpunc(w).split():\n",
" if((cleaned_words.isalpha()) & (len(cleaned_words)>2)): \n",
" if(cleaned_words.lower() not in stop):\n",
" s=(sno.stem(cleaned_words.lower())).encode('utf8')\n",
" filtered_sentence.append(s)\n",
" if (data['score'].values)[i] == 1: \n",
" all_positive_words.append(s) \n",
" if(data['score'].values)[i] == 0:\n",
" all_negative_words.append(s) \n",
" else:\n",
" continue\n",
" else:\n",
" continue \n",
" \n",
" str1 = b\" \".join(filtered_sentence) \n",
" \n",
" final_string.append(str1)\n",
" i+=1"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3062885\n",
"3002812\n"
]
}
],
"source": [
"total_positive_words = len(all_positive_words)\n",
"total_negative_words = len(all_negative_words)\n",
"print(total_positive_words)\n",
"print(total_negative_words)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"apw = random.sample(all_positive_words, 10000)\n",
"anw = random.sample(all_negative_words, 10000)\n",
"freq_negative_words = {x:anw.count(x) for x in anw}\n",
"freq_positive_words = {x:apw.count(x) for x in apw}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Lets see positive sentiment first"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"